148 ◾ Bioinformatics
conservation score, number of sequences, dbSNP accession, and SIFT prediction whether
the variant is tolerated or deleterious.
4.4.2 SnpEff
SnpEff [12] is another variant annotation tool that categorizes the coding effects of variants
based on their genomic locations such as introns, untranslated region (UTR), upstream,
downstream, and splicing site. SnpEff predicts a variety of variant effects including syn-
onymous or nonsynonymous substitution, start-gain codon, start-loss codon, stop-gain
codon, stop-loss codon, or frameshifts.
In general, SnpEff consists of two main components: (i) database builds and (ii) vari-
ant effect calculation. The SnpEff database builds are usually distributed with SnpEff and
there are around hundreds of databases available. A database build is a gzip-compressed
serialized object that is formed of the genome FASTA sequence and an annotation file (in
GTF or GFF format). These database files can be acquired from database resources such as
ENSEMBL and UCSC. The variant effect calculation is performed after building the data-
base. It begins with building a data structure which is a hash table interval trees indexed
by chromosome. The data structure indexes intervals and makes their search efficient. The
SnpEff program uses the VCF file as input and finds the intersections with the annotated
database. The intersecting genomic regions are then identified and the variant effect is
calculated from exonic region only. Simply, SnpEff will take information from the pro-
vided annotation database and populate the input VCF file by adding annotation into the
INFO field name, ANN. Data fields are encoded separated by pipe sign “|”; and the order
of fields is written in the VCF header. As examples, variants may be categorized by SnpEff
as SNP (single-nucleotide polymorphism), Ins (insertion), Del (deletion), MNP (multiple-
nucleotide polymorphism), or MIXED (multiple-nucleotide and InDel). The impacts of
variants are classified into high, moderate, low, or modifier based on the affected region. A
variant will have a high impact when it is disruptive and likely to cause protein truncation,
loss of function, or triggering nonsense mediated decay. The variants with high impact
are frameshift and stop-gain variants. The non-disruptive variants such as missense SNV
and inframe deletion that might change protein effectiveness only are moderate impact
FIGURE 4.10 SIFT 4G annotation file.